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Abstract 



The methods of alpha-stratified adaptive testing and constrained adaptive 
testing with shadow tests are combined. The advantages are twofold: First, 
application of the shadow test approach allows us to implement any type of 
constraint on item selection in alpha-stratified adaptive testing. Second, the 
result yields a simple set of constraints that can be used in any application of 
the shadow test approach to reduce overexposure and underexposure of the 
items in the pool. An example from the Law School Admission Test is used 
to demonstrate the advantages. 

Key words: alpha-stratification; computerized adaptive testing; item- 

exposure control; content constraints; shadow test approach 
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Implementing Content Constraints in Alpha-Stratified 
Adaptive lasting Using a Shadow Tfest Approach 

Among the practical problems emerged since the first applications of computerized 
adaptive testing (CAT) in real-life testing programs, the problems of item exposure control 
and content balancing are most urgent. Adaptive tests that capitalize too much on the 
presence of a few items in the pool and ignore the others are not only cost ineffective but 
also bound to run into security problems. Also, if adaptive test administrations show too 
much variation in content, they are likely to violate important test specifications and the 
testing program looses its content validity. 

Two promising procedures to deal with these problems are alpha-stratified adaptive 
testing (Chang & Ying, 1999) and constrained adaptive testing with shadow tests (van 
der Linden, 2000; van der Linden & Reese, 1998). The proposal of alpha-stratified 
adaptive testing was suggested by the observation that in CAT with maximum-information 
item selection (van der Linden & Pashley, 2000) the first items typically have high 
local discrimination, whereas, because of relatively laige errors in the 0 estimate, 
lower discrimination over a broader interval would be better (Chang & Ying, 1999). 
Alpha-stratified adaptive testing forces the CAT algorithm to select items with lower 
discrimination at the beginning of the test, saving the items with high discrimination for 
the end of it. 

Constrained adaptive testing with shadow tests is a general method to introduce 
constraints on the item selection process. Though developed originally to implement 
content constraints on item selection (van der Linden & Reese, 1998), the method is 
capable to deal with any type of constraint for which a computer algorithm is available. 
Examples of others than content constraints are response-time constraints to control for 
differential speededness among examinees in adaptive testing (van der Linden, Scrams, 
& Schnipke, 1999), constraints on the moments of the item-score distributions to equate 
observed scores between adaptive tests or an adaptive and a paper- and-pencil test (van der 
Linden, 2001), and constraints to select among dimensions in mutidimensional adaptive 
testing (\feldkamp & van der Linden, submitted). 

This paper combines the two methods of adaptive testing. The combination turns 
out to have two advantages. The use of the shadow test allows us to implement virtually 
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any type of constraint on item selection in alpha-stratified adaptive testing. In addition, 
the constraints needed to model alpha-stratified adaptive testing constitute a simple set of 
mathematical (in)equalities. This set can be used in any other application of the shadow 
test approach to reduce overexposure and underexposure of the items in the pool. 

Alpha-Stratified CAT 

The fact that highly-discriminating items may be suboptimal in the presence of errors 
in the estimates of 9 has been ignored in much of the literature on CAT. Nevertheless, 
the phenomenon was already known in classical test theory (CCT) under the name of 
’’attenuation paradox”, where it was shown that an increase in item-criterion correlation 
may imply a paradoxical decrease in the predictive validity of the tests if the items are 
unreliable. The analogy with the current problem arises when noticing the relations 
between item reliability (CCT) and item information (IRT) and between item validity 
(CCT) and item-ability correlation (item discrimination parameter in IRT) (Lord & 
Novick, 1968, 16.5). 

Using an item-selection algorithm in CAT that always picks items with maximum 
discrimination at all 9 estimates has in fact three disadvantages: (1) As already argued, 
the choice is likely to be suboptimal at the beginning of the test where the larger errors 
in the estimates of 9 occur; (2) When the 9 estimate converges towards the end of the 
test, selection with maximum discrimination becomes optimal, but then some of the best 
items in the pool are likely to have already been used; (3) Selecting items with maximum 
discrimination tends to capitalize on estimation errors in the discrimination parameter, 
with potentially serious effects on the estimation of 9 even for calibration samples of 
moderate sizes (van der Linden & Glas, 2000). 

In alpha-stratified adaptive testing, the item pool is stratified on the values of the 
item discrimination parameter. Suppose that R different strata are used, each indexed by 
a value of r = 1, ..., R, where a lower value of r indicates a stratum with lower values for 
the discrimination parameter. Further, suppose that the test consists of n items and that 
n T items are selected from stratum r (^ r n r = n). The order of the strata from which 
the items are selected is then 1, ..., R. Within each stratum, the items are selected to have 
the smallest distance between the value of their difficulty parameter, b it and the current 
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estimate of 9. 

Observe that the order in which the strata are used leads towards more uniform 
exposures rates of the items, particularly if the strata in the item pool are chosen to have 
equal size and n r = n/R. Alpha-stratified adaptive testing thus has the potential of 
more favorable item-exposures rates in combination with a statistically more natural item 
selection criterion. This expectation has been confirmed in studies, for example, by Chang 
and Ying (1999) and Parshall, Hogarty and Kromrey (1999). 

Though generally low and tending to uniformity, the exposures rate of the items 
alpha-stratified adaptive testing do not automatically meet a previously set upper bound. 
An unfavorable combination of size of pool, distribution of the item parameter values, 
number of strata, and test length may lead to higher than desirable exposure rates for 
some of the items. 

In practice, the principle of alpha-stratified adaptive testing can therefore be used 
to increase the effectiveness of the Sympson-Hetter (1985) method of exposure control. 
The success of the latter, which is further described below, also depends on the size and 
composition of the pool. In addition, even for this method and a favorable pool of items, 
no formal proof exists of the exposure rates converging to values below a previously set 
bound for each item (see further below). In practice, however, with the possible exception 
of an occasional item, the method has been proven to be meet reasonable bounds for 
reasonable item pools, especially if the version conditional on 9 proposed by Stocking 
and Lewis (1998, 2000) is applied. 

Application of the principle of alpha-stratification improves the results by the 
Sympson-Hetter method for two reasons: (1) The Sympson-Hetter method does not 
address the problem of the large number of underused items in the pool, whereas alpha- 
stratification does; (2) The method eliminates all items that are selected from the pool 
but not administered. As a result, in a typical application with the maximum-information 
criterion, at the end of the test the number of highly discriminating items left near the 
examinee’s true value of 9 may have been reduced by a factor 3-5. However, if the 
Sympson-Hetter method is applied in combination with alpha-stratified CAT, all best 
items are still available when the last section of the test is reached. 

T\vo remaining problems for alpha-stratified adaptive testing are how to stratify the 
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item pool and balance test content across examinees (Stocking, 1998). The first problem 
is addressed in a companion paper (Chang & van der Linden, submitted), where the 
technique of network-flow programming is used to assign items optimally to strata, the 
objective being uniform distributions both of the discrimination parameter between strata 
and the difficulty parameter within each stratum. The second problem is addressed in the 
remainder of this paper. 

Constrained CAT with Shadow Tfests 

The key idea underlying the shadow test approach is that items are not selected directly 
from the pool but from a shadow test. Shadow tests are a full-size tests assembled prior 
to each item in the adaptive test that have the following properties: (1) they contain all 
items already administered to the examinee; (2) they are optimal at the current 9 estimate 
of the examinee; and (3) they meet all specifications the adaptive test has to meet. The 
item that is actually administered to the examinee is the one in the shadow test that has 
not yet been administered and is optimal at the 9 estimate. After the item is administered, 
the shadow test is returned to the pool, the 9 estimate is updated, and the procedure is 
repeated. 

The only modification of the traditional CAT algorithm needed to execute a shadow 
test approach is a call to a test assembly algorithm prior to the selection of the item. 
Nevertheless, this modification guarantees two important features of the adaptive test. 
First, because each shadow test meets all test specifications, the adaptive test always meets 
all specifications. Second, because each shadow test is assembled to be optimal at the 
current 9 , and each item actually administered is the one in the shadow test optimal at the 
same 9 , the adaptive converges to optimality at the true 9 value of the examinee. Observe 
that these features hold generally, that is, independent of the set of test specifications and 
the criterion of optimality chosen. For a more complete introduction to the shadow test 
approach, technical aspects of its implementation, and applications to item pools from 
large-scale testing programs, see van der Linden (2000). 

Though any test assembly algorithm or heuristic could be used, this paper focuses on 
the class of algorithms based on a 0- 1 linear (LP) or mixed integer programming (MIP) 
approach to test assembly. Key in the approach is the definition of decision variables for 
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the selection of the items in the test. In 0-1 LP-based test assembly, typically variables 
Xi are defined to be equal to one if item i is selected in the test and equal to zero if it is 
not, where i = 1, / is the set of indices denoting the items in the pool. Constraints on 
the item selection process are linear equalities and/or inequalities imposed on the values 
of the decision variables. Content constraints mostly take one of two possible forms, 
depending on whether the attributes of the items that need to be constrained are categorical 
or quantitative. If the attributes are categorical (e.g., as a content classification, learning 
taxonomy, or behavioral description) the set of attributes introduces a partition in the item 
pool that can be denoted as the class of sets V g , g = 1, ..., G and the constraints take the 
form 

|n 3> g — (1) 

iev, g 

If the attributes are quantitative parameters Qi (e.g., response times, word counts, item 
information), each constraint takes the form 

Y2 qiXi § n. (2) 

i= 1 

In addition, an objective function is defined on the variables that is maximized or 
minimized during the item selection process. For example, if the objective is to maximize 
Fisher’s information in the test at the examinee’s current estimate, 6 , the objective function 
is 

/ 

max"^ (3) 

Z=1 

where /*(<?) is the information in the response to item i at 6. 

The model can be solved for optimal values of the decision variables using one of 
the algorithms available in software packages for LE The package used by the authors to 
solve the examples later in this paper was CPLEX 6.6 (ILOG, 2000), one of the fastest 
packages currently available to solve test assembly problems for item pools of the size 
typically used in large-scale testing programs. For a review of the various test assembly 
problems that can be solved using 0-1 LP and the technical details of their solutions, the 




9 



Inplementing Content Constraints - 8 



reader should refer to van der Linden (1998). 



Modeling Alpha-Stratified CAT 



The item response theory (ERT) model used in the examples later in this paper was the 
three-parameter logistic (3PL) model 



Pi(9) = Pr {Ui = 1} = Ci + (1 - Ci) 



expfa^fl - fe t )] 

1 + exp[ai(0 — b{)] ’ 



(4) 



where Ui is the response variable for item i, with Ui = 1 for a correct and Ui = 0 for an 
incorrect response, 9 G R is the ability of the examinee, and a* € (0, oo), bi G R, and q G 
[0, 1) are the discrimination, difficulty, and guessing parameter for item i, respectively. 

Let ik be the index of the item in the pool administered as the fcth item in the adaptive 
test (fc = 1, Assume that k — 1 items have already been administered and that 

stratum r is active when item k is selected. The estimator of 9 after k — 1 items is 
denoted as 9k- 1 • The shadow test assembled for the selection of the fcth item is denoted as 
(ii, ...,ik-i,i' k , where C k -i = {ij, ...,ik-i} is the setofitems already administered 

and F k = {i' k , ■■■, O i s the set of free items. The fcth item is selected from the set Q r C\F k . 

In alpha-stratified adaptive testing the fcth item is selected to have a value for the 
difficulty parameter, b it closest to 9k- 1 - Thus, a natural objective for the shadow test is to 
selects the set of n r items from Q r that have minimum distance to 9k- 1 - This objective is 
realized by requiring this set to have bi values in the interval (9k - 1 — y, 9k - 1 + y), where 
y a nonnegative real- valued decision variable that is minimized. 

The model becomes for the fcth item becomes: 



miny 



(5) 



subject to 

(bi - 9 k -i)xi <y, ie Q r , 



( 6 ) 



(bi 9 k—i)Xi ^ y, i G Qr 



(7) 
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^ ^ ^ -R) (^) 

i€Qr 

J2 Xi = k-1, (9) 

g n 9 , < 7 = 1 ,...,G, ( 10 ) 

iev g 

T Qi X i \ n h, h = 1 , ..., ff, ( 11 ) 



y > o, 



( 12 ) 



6 { 0 , 1 }, i = 1 , 



(13) 



The interval {9 k - 1 — 2/, 0fc-i + J/) for the items in Q r is defined in (6)-(7), whereas the size 
of the interval is minimized in (5). The constraints in (8) require the solution to have n T 
items from each stratum r. The decision variables of the items already selected are set to 
one in (9). The constraints in (10)-(11) represents the sets of categorical and quantitative 
content constraints to be imposed on the item selection process. Finally, in (12)-( 13) the 
ranges of possible values for the decision variables are defined. 

The kth test selected in the adaptive test is 



i k = argmin j|bj — 6 



i eQ T n 






(14) 



Modifications of Sympson-Hetter Method 

The Sympson-Hetter method of exposure control ( 1985) is based on a distinction between 
the events of selecting item i for administration from the pool and actually administering 
the item. We denote these events as Si and and their probabilities as P(Si ) and P(Ai), 
respectively. Because Ai implies Si, it holds that 



P(Ai) = P(Ai,Si) = P(Ai | Si)P(Si). 



(15) 
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For a given CAT procedure it is thus possible to lower exposure rate of item P(Ai ) relative 
to P(Si ) by choosing P(Ai \ Si) < 1. The idea can be implemented by ordering the items 
according to their value for the item-selection criterion at Ok- 1 , selecting the first item, 
and conducting a probability experiment that determines with probability P(Ai \ Si) if 
the item will be administered. If the item is not administered, it is removed from the 
pool during the rest of the test. In principle, it may be necessary to run a long list of 
experiments before an item is administered. Stocking and Lewis (1998) proposed an 
equivalent probability experiment that picks one item for administration from a list of 
fixed length with probabilities with sizes relative to those of the control parameters. 

To adjust P(Ai | Si) to a rate lower than a maximum rate 7 \ selected by the test 
administrator, an iterative series of simulation studies is run in which the probabilities 
P(Si) and P(Si) are estimated and the values of the control parameters P(Ai \ Si) 
adjusted. Let P^iSi) and P^(Ai) denote the probabilities at Step t. The values of 
P(Ai | Si) for the next step are then adjusted by the following rule: 



p(t+i)/ A | c\_/ 1 if pW(Ai) < r , 

p | r/pW(Si) if pW(Ai)>r. 



(16) 



Observe that the equality in (15) only holds within Step t, but that (16) is based on 
the assumption of the same equality for the probabilities between steps. However, the 
assumption is invalid; for example, the actual value of P(Ai) does depend not only the 
values of P(Aj | Sj) and P(Sj) in the previous step for item j = i, but also on those for 
items j / i. For this reason, convergence of the adjustments to values below r* is not 
guaranteed. However, as already noted, in practice for a reasonable CAT procedure and 
item pool, the method shows convergence for nearly all of the items. 

Two modifications of the Sympson-Hetter method are needed to apply the method 
to alpha-stratified CAT implemented through the shadow test approach. First, the list 
of items from which an item is picked for administration is now defined as the set of 
free items in the shadow test, Fk, ordered by the distance of their value for b t to Ok-i- 
Second, because the Sympson-Hetter method removes all previously selected items not 
administered from the pool, it holds that for a combination of a poorly designed pool, tight 
sets of constraints in (10)-(11), and long adaptive tests with low maximum exposure rates 
ri, the model in (6)-(13) may not always have a solution towards the end of the test for 
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each examinees, that is, the test assembly problem may become infeasible. The problem 
is fixed by storing all items that are selected but not administered in a separate set. Let 
Rk-i denote this set if A: — 1 items have been administered. If infeasibility occurs when 
assembling the shadow test for item k, set Rk-\ is added to the pool temporarily, and a 
solution always exist. 

Simulation Study 

A simulation study was conducted to assess the impact of the following choices both on 
the statistical properties of the final estimator, 6 n and the exposures rates of the items: 

(1) Alpha-stratified CAT vs. maximum-information CAT; 

(2) CAT without vs. with content constraints on item selection; 

(3) CAT without vs. with Sympson-Hetter exposure control. 

All possible combinations of choices were examined. The total number of conditions 
in the study was thus equal to 8. 

Item Pool and Ifest Specifications 

The item pool and test specifications were taken from the Law School Admission Test 
(LSAT). The item pool was a previous pool consisting of 753 items. In all, 65 categorical 
and quantitative constraints were needed to model the content specifications for the LSAT. 
The length of the adaptive test was set equal to 50 items, which is half the length of 
the current paper-and-pencil version of the LSAT. The right-hand side coefficients in the 
content constraints in (10)-(11) were reduced proportionally. 

The item pool was divided into R = 5 strata of equal size with the 20% of the items 
with the lowest value for the discrimination parameter in Stratum 1, the next 20% in 
Stratum 2, etc. From each stratum n r = 10 items were selected for the adaptive tests . 

Adaptive Ifests 

In the conditions with alpha-stratified CAT, a test assembly model with the objective 
function in (5) and the associated constraints in (6)-(7) was used. For CAT with maximum- 
information item selection, the objective function and constraints were replaced by 
the objective function in (3). Maximum-information item selection was thus also 
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implemented through a shadow test approach. The conditions with the content constraints 
were realized by added the set of 65 constraints from the LSAT in (10)-(13) to the test 
assembly model. Finally, the Sympson-Hetter method was used with the modifications 
described in the previous section and for all items a target exposure rate of r, = .20. 

Adaptive test administrations were simulated for 9 =-2.0, -1.5, ..., 2.0, with 2500 
replications for each 9 value. The initial value of 9 was set equal to 0. The next estimates 
were EAP estimates with a noninformative prior. The shadow tests were obtained through 
calls to the CPLEX 6.6 software referred to earlier. 

Results 

The bias and MSE functions of the ability estimator in the two main types of CAT in 
the study are displayed in Figure 1 and 2. Ideally, bias functions have negligibly small 
values uniformly over 9 . This ideal was met for all functions in the conditions with alpha- 
stratified CAT. The same holds for maximum-information CAT, with the exception of the 
condition with Sympson-Hetter item-exposure control. In this case, after 20 items the 
lower end of the ability scale showed a negative bias, with considerable size at 9 -- 2.0. 
However, after the full test of 50 items in this condition bias was generally reduced to a 
very low level. 

[Figure 1-2 about here] 

All MSE functions in Figure 2 run horizontally, with the exception of those for 
maximum-information CAT with Sympson-Hetter item-exposure control at n=20. The 
exception points at the bias component obtained for this condition already shown in 
Figure 1. As expected, the MSE functions at n=50 items were much lower than those 
at n=20. Also, the functions for maximum-information CAT were lower than those for 
alpha-stratified CAT. However, for n=50 items, both types of CAT showed satisfactory 
MSE. For the condition with alpha-stratified CAT at n=20, it should be noted that at 
this stage only the first two strata, with the items with the lowest discrimination in the 
pool, were covered. A genuine 20-item alpha-stratified CAT would have consisted of 
five different strata of five items each. Thus, the relatively large MSE in this condition 
should not come as a surprise. 

Generally, imposing content constraints on an item selection process tends to produce 
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poorer ability estimates than unconstrained item selection from the same pool. However, 
in spite of the large number of constraints for both types of CAT hardly any increase 
in MSE was observed. The most likely explanation for this phenomenon is the quality 
of the item pool. The items in this pool were carefully written according to the content 
specifications for the LSAT. Hence, the shadow test algorithm did not have to force item 
selection much to meet the constraints. 

[Figure 3 about here] 

In Figure 3, the empirical exposure rates of the items are presented in a decreasing 
order. For all conditions, the rates for alpha-stratified CAT were much more uniform than 
those for maximum-information CAE The addition of Sympson-Hetter item-exposure 
control to the procedure had a favorable impact on maximum-information CAT, but the 
resulting rates were still much more unfavorable than those for alpha-stratified CAT. 

Discussion 

Large numbers of content constraints can easily be implemented in alpha-stratified CAT 
through a shadow-test approach. For a well-designed item pool, such as the one from 
the LSAT in the empirical study, imposing content constraints on the item selection do 
not need to have any disadvantageous impact on the statistical properties of the ability 
estimator. Relative to maximum-information CAT, alpha-stratification tends to result in 
much more favorable exposures rates for the items. The rates for the popular items are 
likely to be reduced considerably and, equally important, those for the unpopular items to 
go up to much more acceptable levels. The price to be paid for this result is a slight loss 
in the accuracy of the estimator. However, from a practical point of view, this loss can be 
compensated for by adding a few items to the test, whereas loss due to item compromise 
or inefficient item use is more difficult to compensate. 
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Figure 1. Bias functions for alpha-stratified (bold lines) and maximum-information 
CAT (thin lines) after n=20 (dashed lines) and n=50 items (solid lines) under the 
conditions with/without content constraints and with/without Sympson-Hetter item- 
exposure control. 

Figure 2. MSE functions for alpha-stratified (bold lines) and maximum-information 
CAT (thin lines) after n=20 (dashed lines) and n=50 items (solid lines) under the 
conditions with/without content constraints and with/without Sympson-Hetter item- 
exposure control. 

Figure 3. Item exposure rates for alpha-stratified (bold lines) and maximum-information 
CAT (thin lines) under the conditions with/without content constraints and with/without 
Sympson-Hetter item-exposure control. 
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